Final Report on Brooklyn Housing Dataset (2003-2017)

Class: Data Science 1 with R (STAT 301-1)

Author

Brian Dinh

Published

November 27, 2023

1 Introduction

For this report, I explored a dataset labelled “Brooklyn Home Sales 2003 to 2017,” which describes information in regards to all buildings, residential and nonresidential, sold in the New York borough of Brooklyn. The data comes from the government of the state of New York, and links to the data can be found under References (Section 4). Additionally, the cleaned dataset could not be uploaded to GitHub for the same reason, so I have added a Google Drive link to it here.

For this exploratory data analysis, I was primarily motivated by my curiosity in what could be the main motivators for selling more buildings in the Brooklyn market, especially since buildings in this area tend to be more expensive than the rest of the United States due to its urbanity and access to New York City as a whole. Additionally, I wished to look at what factors could have affected housing price increases in the area. Last of all, I wanted to look at what areas in Brooklyn are the most desired based on the popularity in sales in this dataset and what neighborhoods have become less popular.

2 Data Overview and Quality

For the original “Brooklyn Home Sales 2003 to 2017” dataset, there are 111 variables and 390,883 observations. There are 32 categorical variables, 71 numerical variables, 7 logical variables, and 1 date variable.

There are missingness issues for many columns associated with geographic mapping information, borough data (e.g. what borough the building is in), and some building information (like total units, building stories, etc.). Due to this missingness, I will not be able to construct a geographic visualization of the data using a dashboard package like Shiny and may be limited in my analysis of nonresidential buildings, as thorough building information could affect nonresidential prices. I do not believe the missing borough data will affect my analysis because all of the buildings sold are located in Brooklyn.

The dataset is not in the GitHub for this final project as it is too large (207.6 MB) to commit to the GitHub, so please refer to the hyperlink above in Introduction (Section 1) for access to the original dataset and the cleaned dataset.

The cleaned dataset has 390,833 observations and 81 columns. There are 19 categorical variables, 61 numeric variables, and 1 date variable.

3 Explorations

3.1 How many buildings are being sold in Brooklyn over time?

Before exploring this question, I wanted to check to see the different types of buildings being sold in Brooklyn, because my initial hypothesis assumes that the building type can greatly affect certain data about the building. The variable of tax_class_at_sale categorizes the type of building sold into 4 categories according to the state of New York:

  • (Class 1): Includes most residential property of up to three units (such as one-, two-, and three-family homes and small stores or offices with one or two attached apartments), vacant land that is zoned for residential use, and most condominiums that are not more than three stories.
  • (Class 2): Includes all other property that is primarily residential, such as cooperatives and condominiums.
  • (Class 3): Includes property with equipment owned by a gas, telephone or electric company.
  • (Class 4): Includes all other properties not included in class 1,2, and 3, such as offices, factories, warehouses, garage buildings, etc.

Figure 1: ?(caption)

Distributions of Building Sold in Brooklyn by Tax Classes

According to ?@fig-tax-1, it appears that the most actively sold buildings are primarily residential buildings in classes 1 and 2, with class 4 buildings coming third and class 3 buldings not being sold too often. Additionally, class 4 buildings appear to cause large skews in both price and square feet, as according to ?@fig-tax-2 and ?@fig-tax-3, the majority of larger and more expensive buildings were class 4.

Thus, I will separate my dataset into residential (class 1 and class 2) buildings and non-residential (class 3 and class 4) buildings, with the majority of my analysis focusing on the residential dataset.

After separating the dataset, I wanted to look at how the number of sales for buildings have changed year by year, as I want to see if there are any periods of downturns or relatively high counts of building sales.

Figure 2: ?(caption)

Distribution of Brooklyn Building Sales by Year

In both ?@fig-year-count-1 and ?@fig-year-count-2, the distribution of sales in Brooklyn tended to be the highest from 2003 to 2006, while there was a significant dip in sales from 2007 to 2010, which was stronger for residential buildings than it was for non-residential buildings. From to 2011 to 2017, there appears to have been a recovery in building sales. These three time periods are interesting to look at in terms of determining if the three time periods had the same trends or not. For example, one immediate question I had was if the decrease in sales was due to an increase or decrease in price in the building market. To analyze this, I made the following line plots, which analyze the average sale prices for buildings sold in Brooklyn, grouped by year.

Figure 3: ?(caption)

Average Building Sale Prices by Year in Brooklyn

Looking at ?@fig-sale-line-1 and ?@fig-sale-line-2, it appears that the average sale price of buildings in Brooklyn actually decreased in the 2007-2010 range, which means that my hypothesis of lower sales meaning higher prices was incorrect. To investigate further, let us look at the price ranges of the buildings sold during the three time periods of 2003-2006, 2007-2010, and 2011-2017.

Figure 4: ?(caption)

Distribution of Count of Residential Buildings Sold in Brooklyn by Price

In ?@fig-price-cuts-1, ?@fig-price-cuts-2, and ?@fig-price-cuts-3, the most common selling price in Brooklyn ranges from the $300k to $500k range, with this being the most prominent within ?@fig-price-cuts-1. However, with ?@fig-price-cuts-2, the number of homes sold at slightly above the $300k to $500k range significantly decreased, which could have caused the average price decreases for this time period. Within the 2011 to 2017 range, there has been a more subtle increase in properties being sold in the $700k to $1m range despite buildings in the $300k to $500k range being the most common, which could explain the immense increases in residential building prices over time.

NOTE: CONSIDER ADDING NON-RESIDENTIAL ANALYSIS OF BUILDINGS SOLD WITH PRICE CUTS

3.2 What affects the price for buildings?

For the next part of my EDA, I wanted to look at what affects the price of buildings in Brooklyn. First off, I wanted to analyze the relationship between square feet and price using a scatter plot.

Figure 5: Brooklyn Residential Building Sale Price compared to Gross Square Ft

In Figure 5, looking at the linear fitting line, it appears that there is a general positive relationship between price and square feet, meaning that the more square feet a property has, the more expensive it is, which was expected.

Another variable I wanted to look at was prox_code, which identifies the proximity of the property to another property. prox_code is split into three categories: detached, semi-attached, and attached.

Figure 6: Distribution of Residential Buildings by Proximity Code

(a) Yearly Average Residential Building Prices in Brooklyn, Split by Proximity Codes

(b) Count of Residential Buildings Sold in Brooklyn from 2003-2017

Figure 6 (a) reveals that on average, detached and attached buildings were around the same price points from 2003 to 2017, while semi-attached buildings tend to be cheaper than other properties. However, in Figure 6 (b), semi-attached buildings are the 2nd most sold, with attached buildings being the highest and detached buildings being the least sold. This information may indicate that the cheapness of semi-attached properties could result in more transfers of properties or sellings.

4 References